Search CORE

24 research outputs found

A Study of a Non-Resourced Language: The Case of one of the Algerian Dialects

Author: Bouchemal Najette
Meftouh Karima
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 07/05/2012
Field of study

International audienceThis paper presents a linguistic study of an algerian arabic dialect, namely the dialect of Annaba (AD). It also presents the methodology applied in the construction of a parallel corpus MSA-AD. This work is done in a future goal of developing a machine translation system of standard Arabic (MSA) to algerian arabic dialects

INRIA a CCSD electronic archive server

Hal-Diderot

Creating Parallel Arabic Dialect Corpus: Pitfalls to Avoid

Author: Harrat Salima
Meftouh Karima
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 17/04/2017
Field of study

International audienceCreating parallel corpora is a difficult issue that many researches try to deal with. In the context of under-resourced languages like Arabic dialects this issue is more complicated due to the nature of these spoken languages. In this paper, we share our experiment of creating a Parallel Corpus which contain several dialects and Modern Standard Arabic(MSA). We attempt to highlight the most important choices that we did and how good were these choices

INRIA a CCSD electronic archive server

Hal-Diderot

Script Independent Morphological Segmentation for Arabic Maghrebi Dialects: An Application to Machine Translation

Author: Harrat Salima
Meftouh Karima
Smaïli Kamel
Publication venue: 'Instituto Politecnico Nacional/Centro de Investigacion en Computacion'
Publication date: 01/01/2019
Field of study

International audienceThis research deals with resources creation for under-resourced languages. We try to adapt existing resources for other resourced-languages to process less-resourced ones. We focus on Arabic dialects of the Maghreb, namely Algerian, Moroccan and Tunisian. We first adapt a well-known statistical word segmenter to segment Algerian dialect texts written in both Arabic and Latin scripts. We demonstrate that unsupervised morphological segmentation could be applied to Arabic dialects regardless of used script. Next, we use this kind of segmentation to improve statistical machine translation scores between the tree Maghrebi dialects and French. We use a parallel multidialectal corpus that includes six Arabic dialects in addition to MSA and French. We achieved interesting results. Regards to word segmentation, the rate of correctly segmented words reached 70% for those written in Latin script and 79% for those written in Arabic script. For machine translation, the unsupervised morphological segmentation helped to decrease out-of-vocabulary words rates by a minimum of 35%

INRIA a CCSD electronic archive server

Hal-Diderot

Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus

Author: Abbas Mourad
Harrat Salima
Jamoussi Salma
Meftouh Karima
Smaili Kamel
Publication venue
Publication date: 01/01/2015
Field of study

Waseda University Repository

Maghrebi Arabic dialect processing: an overview

Author: Harrat Salima
Meftouh Karima
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 05/12/2017
Field of study

International audienceNatural Language Processing for Arabic dialects has grown widely these last years. Indeed, several works were proposed dealing with all aspects of Natural Language Processing. However , some AD varieties have received more attention and have a growing collection of resources. Others varieties, such as Maghrebi, still lag behind in that respect. Maghrebi Arabic is the family of Arabic dialects spoken in the Maghreb region (principally Algeria, Tunisia and Morocco). In this work we are interested in these three languages. This paper presents a review of natural language processing for Maghrebi Arabic dialects

INRIA a CCSD electronic archive server

Hal-Diderot

Comparative study of Arabic and french statistical language models

Author: Laskri Med Tayeb
Meftouh Karima
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 19/01/2009
Field of study

International audienceIn this paper, we propose a comparative study of statistical language models of Arabic and French. The objective of this study is to understand how to better model both Arabic and French. Several experiments using different smoothing techniques have been carried out. For French, trigram models are most appropriate whatever the smoothing technique used. For Arabic, the n-gram models of higher order smoothed with Witten Bell method are more efficient. Tests are achieved with comparable corpora and vocabularies in terms of siz

INRIA a CCSD electronic archive server

Hal-Diderot

Arabic Statistical N-gram Models

Author: Laskri Mohamed Tayeb
Meftouh Karima
Smaïli Kamel
Publication venue: 'Praise Worthy Prize'
Publication date: 01/01/2009
Field of study

International audienceIn this work we propose to investigate statistical language models for Arabic. Several experiments using different smoothing techniques have been carried out on a small corpus extracted from a daily newspaper. The sparseness data conducts us to investigate other solutions without increasing the size of the corpus. A word segmentation has been operated in order to increase the statistical viability of the corpus. This leads to a better performance in terms of normalized perplexit

INRIA a CCSD electronic archive server

Hal-Diderot

Arabic statistical language modeling

Author: Laskri Mohamed-Tayeb
Meftouh Karima
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 12/03/2008
Field of study

International audienceIn this study we propose to investigate statistical language models for Arabic. Several experiments using different smoothing techniques have been carried out on a small corpus extracted from a daily newspaper. The sparseness of the data leads us to investigate other solutions without increasing the size of the corpus. A word segmentation technique has been employed in order to increase the statistical viability of the corpus. This leads to a better performance in terms of normalized perplexity

INRIA a CCSD electronic archive server

Hal-Diderot

Grapheme To Phoneme Conversion - An Arabic Dialect Case

Author: Abbas Mourad
Harrat Salima
Meftouh Karima
Smaïli Kamel
Publication venue: HAL CCSD
Publication date: 14/05/2014
Field of study

International audienceWe aim to develop a speech translation system between Modern Standard Arabic and Algiers dialect. Such a system must include a Text-to-Speech module which itself must include a grapheme-phoneme converter. Algiers dialect is an Arabic dialect concerned by the most problems of Modern Standard Arabic in NLP area. Furthermore, it could be considered as an under-resourced language because it is a vernacular language for which no substantial corpus exists. In this paper we present a grapheme-to-phoneme converter for this language. We used a rule based approach and a statistical approach, we got an accuracy of 92% VS 85% despite the lack of resource for this language

INRIA a CCSD electronic archive server

Hal-Diderot

Machine Translation Experiments on PADIC: A Parallel Arabic DIalect Corpus

Author: Abbas Mourad
Harrat Salima
Jamoussi Salma
Meftouh Karima
Smaili Kamel
Publication venue: HAL CCSD
Publication date: 30/10/2015
Field of study

International audienceWe present in this paper PADIC, a Parallel Arabic DIalect Corpus we built from scratch, then we conducted experiments on cross-dialect Arabic machine translation. PADIC is composed of dialects from both the Maghreb and the Middle-East. Each dialect has been aligned with Modern Standard Arabic (MSA). Three dialects from Maghreb are concerned by this study: two from Algeria, one from Tunisia, and two dialects from the Middle-East (Syria and Palestine). PADIC has been built from scratch because the lack of dialect resources. In fact, Arabic dialects in Arab world in general are used in daily life conversations but they are not written. At the best of our knowledge, PADIC, up to now, is the largest corpus in the community working on dialects and especially those concerning Maghreb. PADIC is composed of 6400 sentences for each of the 5 concerned dialects and MSA. We conducted cross-lingual machine translation experiments between all the language pairs. For translating to MSA we interpolated the corresponding Language Model (LM) with a large Arabic corpus based LM. We also studied the impact of language model smoothing techniques on the results of machine translation because this corpus, even it is the largest one, it still very small in comparison to those used for translation of natural languages

INRIA a CCSD electronic archive server

Hal-Diderot